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Efficient GPU Programming 


Making the most of the pipeline! 

■ Optimizations within the IA software stack 

■ Application specific 

■ Generic 

■ Greatest impact from application optimization 

■ Meet your friendly AE! 


Tooltip! 


- Use GPA™ to find and optimize GPU hotspots 


Scope of Optimizations 



Low 


Big Picture! 


Draw...Frame 


Draw 





Draw Dispatching and Resource Update 


Be conscious of memory access patterns ■ 3D / 2D operation scheduling 


of dispatched operations 


Large Surfaces (high latency) 


Resource A 


Resource B 


Resource A 
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Draw 0 


Draw 1 


Draw 2 

Vs. 



Resource A 


Resource A 


Resource B 
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■ State / shader changes 

■ Resource locality 


Small Surfaces (low latency) 

I 1 



Write to same RT region 


Draw 0 


Draw 1 (2) 


Draw 2 (1) 


































Platform 

More than just the sum of it's parts... 



Platform 


Graphics is only part of the puzzle 

■ Unique architecture characteristics 

■ Power & performance 

■ Memory hierarchy 

■ Paired platform 

■ CPU 

■ System memory 

■ Other constraints 

■ Thermal 

■ Power 


Platform 



Un-core 










CPU Optimization 


Relationship between CPU / GPU 

■ CPU or GPU bottleneck 

■ CPU can limit GPU 

■ Whaaa?.... 


Tooltip! 


- Use VTune™ to find and optimize CPU hotspots 


Power (1 5W Total) 



Frequency (GPU Peak 1 .1 5Ghz) 


^CPU 

^GPU 

^MJncore 


^CPU 

^GPU 
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Cache Locality Is Kin<f 


Optimize memory accesses for both CPU and GPU 

■ Memory bandwidth bound 


DRAM 


■ Hierarchy varies with platform 

■ Optional CPU + GPU Caches 

- Last Level Cache (LLC) 

- Embedded DRAM (eDRAM) 

■ GPU 















IA Graphics 


This is what you came for right? 
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Architecture 


Architectural components 

■ Non-Slice 

■ Fixed function 

- Transformation 

- Clipping 

■ Slice 

■ Slice common 

- Rasterization 

- Shader dispatch 

- Color back-end 

■ Sub-slice(s) 

- Shader execution 


Non-Slice 


Fixed Function 


Slice 



Shader Execution 


Memory Interface 













Architecture Scaling 


Scaling Components 

■ Slice 

■ Parallel primitive processing 

■ Sub-slice 

■ Parallel span processing 



Slice Scaling (1 - N Slices) 


Slice 



Sub-Slice Scaling (1 - N Sub-Slices) 




















Sampler 


1 Sampler Per Sub-Slice 

■ Local texture cache (Tex$) 

■ Backed by common 13$ 



Non-Slice 


Fixed Function 


Slice 



Memory Interface 













Sampler Performance 


Remember Cache Locality? © 

■ Throughput 

■ Format 

■ Sampling pattern 

■ Poor access pattern 

■ Increased memory b/w 

■ Increased latency 



^^Good 

^^Bad 
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Texture Compression 


Utilize as much as possible! 

■ Offline compression 

■ Dynamic compression 



Original Surface 





Fillrate 


Per Slice-Common 

■ Pixel Back-End 

■ Color Cache (RCC$) 


Ex. Synthetic Fillrate vs. Slice Count 



Non-Slice 


Fixed Function 


Sampler 


Tex$ 



Sampler 


Memory Interface 











Filtrate Performance 


Pumping out color 

■ Throughput 

■ Format 

■ Dimension + region 

■ Other factors 

■ Rasterization 

■ Early Z/STC 

■ Pixel Shader Execution 

■ Late Z/STC 

■ Blend function + mode 



i 


2 

Sub-Slices 


N 


Surface Format 


Select the appropriate format for color range 

■ Intermediate / final render targets 

■ Cause 

■ Higher precision format chosen un-necessarily 

■ Effect 

■ Reduced fill rate 

■ Increased memory bandwidth 



HDR (R16G16B16A16) 


HDR (R10G10B10A2) 





Arithmetic Logic 


Block Per Sub-Slice 

■ Execution Units (EUs) 

■ Instruction Cache (IC$) 



Non-Slice 


Fixed Function 
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Memory Interface 





















Arithmetic Logic Performance 


Algorithmic Complexity 

■ Control flow 

■ Math 

■ Extended math 

■ Max concurrent registers 



Synthetic Relative Performance - EU Operations 
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Shader Optimization 


Optimal code based on purpose 

■ Shader scaling 

■ The case of the generic shader 

■ Generation of un-used outputs © 



■ Original 
Optimized 


Cycles 


vs_3_0 

def cl 7, 2, -1,0, 1 

def cl 8, 1 .44269502, 0.00999999978, -1 .44269502, 0 
dcl_position vO 
dcl_normal vl 
dcl_color v 2 
dcLposition oO 
dcl_texcoord ol 
dcl_texcoord 1 o 2 
dd_texcoord2 o3.xyz 
dcl_texcoord3 o4.xyz 
dd_color o5 
dd_texcoord4 06 
dd_texcoord5 o7 
dd_texcoord 6 o 8 .xy 
mul rO, c5, vO.y 
mad rO, vO.x, c4, rO 
mad rO, vO.z, c 6 , rO 
mad oO, vO.w, c7, rO 
.. 76 instructions... 
mov o7, v2 



vs_3_0 

dcLposition vO 
dcLposition oO 
mul rO, c5, vO.y 
mad rO, vO.x, c4, rO 
mad rO, vO.z, c 6 , rO 
mad oO, vO.w, c7, rO 


1000 

800 

600 

400 

200 

0 



■ Original 
Optimized 


Cache Entries 



Geometry 


Single Non-Slice 

■ Fixed Function 

■ VS 

■ HS 

■ TE 

■ DS 

■ GS 

■ SOL 

■ Clipper 

■ Setup Front-End 
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Memory Interface 



















Optimizing Geometry for Algorithmic Complexity 


Optimal definitions for a single piece of 
geometry 

■ Quality scaling with platform 

■ Purpose 

■ Lighting, depth, animation... 


Ex. Edge Softening 



Merged Vertex + Normal 




Model with Hard Edges 



After (Soft Edge) 


Model with Soft Edges 



Optimizing Primitive Ordering 


Primitive scheduling within a single draw 

■ Ordering primitives for both locality and latency 

■ Two cases 

■ View dependent 

■ View independent 

■ Sample example (HDA01 0.1) 

■ Primitive dispatch color coded (green -> red) 

■ 2%-1 3% performance gain 



Original Ordering 



View Dependent Ordering 









Scaling Your Game 

Burn baby burn, heat inferno... 


Lost Planet 2 : Images courtesy of Capcom 





Why do you care? 


Wide Range of Platforms + CPU + GPU 

■ Each with unique performance characteristics 

■ All of which the user hopes to run your game 

■ And run it well © 


Better selling point? more platforms + happy users == more money? $$$ © 



How Well Does Your Game Scale? 


■ Created a game 

■ Quality settings 





Memory Bandwidth 


It's all about the memory., baby 

■ Will vary greatly with platform 

■ Why do you care? 

■ Read from memory 

■ Write to memory 


Goal 


- Establish memory ceiling (budget) 






Sampler Throughput 


Varies with architecture and platform 

■ Measure all use cases 

■ Dimension 

■ Format 

■ Filtering mode 


Goal 

- Select optimal format & dimension 


Ex. Synthetic Sampler Throughput 32bit Use Cases 



32bit (Point/Bilinear) 
32bit (Trilinear) 
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Fill Rate 


Multiple surface types 

■ Render target 

■ Format 

■ Dimension 

■ Blended / Non-blended 

■ Depth 

■ Read +/ Write 

■ Stencil 

■ Read +/ Write 

Goal 


- Understand relative performance 

- Optimal format, dimension, and algorithm 


Synthetic Relative Performance - Fullscreen Primitive 



RT 32bit 


RT 32bit 
(Blend) 


Z Fail 


Z Pass 


STC Test 



Geometry Throughput 


Ex. Geometry Selection Low vs. High Throughput 


Fixed function bandwidth and Arithmetic Logic 

■ Fixed function 

■ Clip /Cull 

■ Rasterization 

■ Geometry transformation 

■ ALU 



Goal 


- Optimal geometry and algorithm 


Synthetic Relative Performance - EU Operations 



MAX LRP CMP LOG EXP POW ADD MUL 





THE END 


Conclusion 

Wrapping it all up in a bow.. 




Looking Forward 


Same game for desktop to phone 

■ Wide array of platforms 

■ Adaptable quality settings 

■ Scaling algorithms 

■ Optimization 


Thanks for attending! 





Questions? 


Contact Information 
■ E-mail : robert.b.taylor@intel.com 
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Ready for More? Look Inside™ 

Keep in touch with us at GDC and beyond: 

• Game Developer Conference 

Visit our Intel® booth #1 01 6 in Moscone South 

• Intel University Games Showcase 
Marriott Marquis Salon 7, Thursday 5:30pm 
RSVP at bit.ly/intelgame 

• Intel Developer Forum, San Francisco 
September 9-11, 2014 

intel.com/idf1 4 

• Intel Software Adrenaline 
@inteladrenaline 

• Intel Developer Zone 
software.intel.com 
@intelsoftware 
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Up Next... 


1 2:30 - 1 :30 

Realistic Cloud Rendering using Pixel Synchronization 


Presented by: 
Egor Yusov - Intel 



